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ABSTRACT* 

Code-excited linear prediction coding is an 
efficient technique for compressing speech se- 
quences. Communications quality of speech can 
be obtained at bit rates below 8 Kb/s. How- 
ever, relatively large coding delays are neces- 
sary to buffer the input speech in order to per- 
form the LPC analysis. In this paper we intro- 
duce a low-delay 8Kb/s CELP coder in which 
the short-term predictor is based on past syn- 
thesized speech. A new distortion measure that 
improves the tracking of the formant filter is dis- 
cussed. Formal listening tests showed that the 
performance of the backward- adaptive coder is 
almost as good as the conventional CELP coder. 

INTRODUCTION 

Recent advances in linear prediction coding 
have made it possible to achieve communica- 
tions quality of speech at bit rates below 8 Kb/s. 
Practical real-time implementations are possi- 
ble due to efficient algorithms based on Code- 
Excited Linear Prediction (CELP) [1]. In these 
coders, the residual is vector quantized using an 
analysis-by-synthesis search procedure. The ex- 
citation vector (or codevector) is chosen from a 
large codebook. All the codevectors are passed 
through the synthesis filters and compared with 
the original speech vector. The index of the 
codevector that minimizes an objective distor- 
tion measure between original and quantized 
speech is sent through the channel. The parame- 
ters of the synthesis filters (gain, long- and short- 
term LPC coefficients, and pitch lag) are sent 
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to the decoder as side information. Gain, pitch 
lag, and long-term predictor coefficients can be 
optimized using closed-loop procedures. Ide- 
ally, the formant filter could also be optimized 
in a closed-loop procedure but this would lead 
to a mathematically untractable set of non-linear 
equations [2]. Therefore, the short-term predic- 
tor coefficients are calculated using an open- loop 
solution based on the original speech. In order to 
obtain a reliable linear prediction filter, approx- 
imately 20 ms of speech samples are buffered. 
The one-way delay of the coder, although highly 
dependent on real-time implementations, could 
be as high as 60 ms. The delay could be re- 
duced by using only past speech (no buffering). 
However, the linear prediction analysis would be 
unreliable, resulting in poor speech quality. This 
problem can be overcome by updating the LPC 
parameters at a higher rate. This would require 
more bits/sample, thereby increasing either the 
total bit rate or the distortion. 

In this paper, we present an 8 Kb/s CELP 
coder in which the short-term linear prediction 
parameters are updated in a backward- adaptive 
manner. That is, the linear prediction analysis 
is performed on past synthesized speech which 
is available, assuming no transmission errors, at 
both ends of the channel. Therefore, the LPC 
parameters are not sent through the channel and 
can be updated at high rates, even in a sample- 
by-sample basis. Speech quality is as good as in 
the conventional (or forward-adaptive) coder. A 
new distortion measure is introduced to prevent 
predictor mistracking. 

A diagram of the encoder is shown in figure 
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1. The synthesis filters are separated into their 
zero-input and zero-state components. The min- 
imum pitch is constrained to be always greater 
than the block size. Therefore the transfer 
function of the zero-state pitch synthesis filter 
is unity. The weighting filter is moved from 
its original location (filtering the error between 
original and synthesized speech) to both of its 
input branches. 



Figure 1 CELP Encoder 


CODER DESIGN 

Backward adaptation 

In pure backward-adaptive 16 Kb/s CELP 
coders [3] [4] only the excitation vector index 
is sent through the channel. The rest of the pa- 
rameters are computed in a backward-adaptive 
mode. The vector dimension (block of samples) 
is 4—5 samples and the one-way delay is below 2 
ms. Unfortunately, as the bit rate decreases the 
quantization effects become more pronounced, 
leading to poor filter tracking and to further dis- 
tortion of the original speech. As a result, in 
the BA-CELP coder only the short-term predic- 
tor is computed in a backward-adaptive manner. 
Pitch filter parameters and gain are optimized 
in closed- loop procedures and sent through the 
channel as side information. The three-tap pitch 
filter plays an important role, not only in the fine 


structure but also in the shape of the spectrum 
of the reconstructed speech. 

In our BA-CELP coder, all the parameters 
are updated at the end of each block of sam- 
ples. The bit allocation scheme is shown in table 
1. The vector dimension is 26 samples and the 
sampling rate is 8 Khz. Consequently, the total 
delay (typically 4 times the vector dimension) 
is around 13 ms. The short-term LPC analy- 
sis is performed using the modified covariance 
method. The length of the frame is four times 
the vector dimension. Note that the frames are 
highly overlapped and relatively short. This is 
necessary to improve the tracking of the adaptive 
filter, specially when rapid changes of the spec- 
trum occur. The autocorrelation method proved 
to be inefficient in this application. This is be- 
cause the windowing process weights the error 
in the middle of the frames higher than at the 
edge of the frames. As a result, spectral match 
is poor in regions of rapid spectrum variations. 


Parameter 

bits/block 

Kbits/sec 

Formant filter 

0 

0.0 

Pitch filter 

5 

1.5 

Pitch lag 

7 

2.2 

Gain 

5 

1.5 

Excitation vec. 

9 

2.8 

Total 

26 

8.0 


Table 1 Bit allocation and corresponding 
bit rate. The vector dimension is 26 
samples and the sampling rate is 8 Khz. 


Perceptual weighting filter 

Psychoacoustical studies show that the hu- 
man auditory system can tolerate more errors in 
the formants of the speech spectrum than in the 
valleys. Therefore, we can obtain a more sub- 
jective distortion measure by weighting the spec- 
trum of the error. Regions of the error spectrum 
that correspond to valleys between formants in 
the speech spectrum are de-emphasized and re- 
gions corresponding to the formants are empha- 
sized. Using a weighting function W(z) we can 
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write, 


2-rr 

€w= h j\ s ^ ~ § ^ r w ^ du 

0 

( 1 ) 

where e w is the noise-weighted mean-squared 
error (NWMSE). A general weighting filter is 
discussed in [3] and [5]. 

w(z) = 4-rr^ > o < 72 < 71 < i (2) 
A{z/ 7 2 ) 

A good choice for the parameters is 7 7=0.9 and 
j 2 =0.4. Note that in conventional CELP coders 
only one LPC analysis is necessary for the syn- 
thesis and weighting filters. Conversely, the 
backward- adaptive approach “requires” two sep- 
arate LPC analyses. One based on reconstructed 
speech for the synthesis filter and the other one 
based on the original speech for the weighting 
filter. 

Mixed distortion 

Further improvement in filter tracking can be 
achieved by using a mixed distortion measure in 
the excitation vector search procedure. The pro- 
posed mixed distortion combines mean-squared 
error with a subjectively meaningful LPC dis- 
tortion measure. 

Log-likelihood ratio distortion measure. 

In linear prediction theory, the minimum 
residual energy for a particular speech frame is 
given by 

a = r 0 — a T r (3) 

where r is the correlation vector, ro is the en- 
ergy of the segment and a is the optimum LPC 
coefficient vector. If the same frame is passed 
through a non-optimum inverse filter then the 
residual energy /? must be greater than a, 

f3 — r 0 — 2a T r 4- a T Ra > a (4) 

where R is the correlation matrix. Equality holds 
when a = a. The log-likelihood ratio (LLR) 
distortion measure is defined as 


which is equivalent to the difference of the log- 
arithmic prediction gains. The LLR distortion 
measure has proved to be subjectively meaning- 
ful [6] [7]. 

Figure 2 shows the filtering operation. The 
two input sequences are s(n) and s(n). The cor- 
responding p ,h order inverse filters are A(z) and 
A(z). When one of the input sequences, say s(n) 
is passed through the filters, the resulting resid- 
ual energies are a and /?. A different distortion 
would be obtained by using s(n) instead of s(n) 
as the input sequence. This difference shows the 
asymmetric nature of the likelihood ratio. 



Figure 2 Computation of the residual energies 
in the log-likelihood ratio distortion measure. 


Search procedure. The optimum excitation 
vector is searched in two sequential steps. First, 
a conventional search algorithm finds the best n c 
excitation vectors that minimizes the NWMSE. 
The best n c candidates are used in the second 
stage in order to minimize the mixed distortion 
measure. 

The convolution of the I th codevector v(l) 
with the impulse response of the weighted syn- 
thesis filter can be written in matrix form as, 

z (7) = G ? Hv(/) (6) 

where G is the gain, H is a lower triangular 
toeplitz matrix containing the impulse response 
in its first column and K is the vector dimen- 
sion, 



d LLR (A(z),A(z)) =log(£) (5) 
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The NWMSE is given by, 

««.(<) = II* — * (Oil 2 

= ||zf - 2G* T Hv (1) + G V (!) H r Hv (I) 

( 8 ) 

Taking the derivative with respect to G we get 
the minimum NWMSE and the optimum gain 
for the I th codevector, 


% = -2z t Hv (/) + 2 Gv t (/)H t Hv (0 = 0 
aG 


Gcpi (/) — 


z t Hv (/) 


... (0 = 


v T (I)H T Hv(I) 

(z r Hv(/)) : 


Izll 2 - 


v T (/)H T Hv(/) 


(9) 

In order to find the n c optimum excitation vec- 
tors out of the L-level codebook it is necessary 
to maximize the second term of the minimum 
error: 


find /, 


opt 


max 


(z T Hv(/))' 


i=o.T-i v T (/)H T Hv (Z) 


( 10 ) 


The computational complexity of the search is 
reduced by using a shift-symmetric codebook 
[ 8 ]. 


In the second stage of the search, the opti- 
mum codevector is chosen out of the n c candi- 
dates. The objective is to choose a codevector 
that minimizes the distortion between the origi- 
nal LPC model A(z) and the backward-adaptive 
LPC model A(z) one vector into the future. The 
original LPC model has already been computed 
for the weighting filter. To calculate the corre- 
sponding A(z) for each candidate, we compute 
the next block of speech samples and perform 
the LPC analysis on the updated synthesized 
speech sequence. The mixed distortion is de- 
fined as, 

d mix = d LLR 0) > & (*)) + n lo S ~jzt) 
i = 1 • • • n c 

(ID 

where is the minimum NWMSE of the 
candidates and r? is a parameter to be optimized 
in subjective tests. In equation 11, as 77 goes to 
infinity the mixed distortion measure becomes 


equivalent to the NWMSE. On the other hand, 
as 77 approaches zero the LPC distortion of future 
frames decreases at the cost of accuracy in the 
current block of samples. 


SIMULATION RESULTS 

Computer simulations results were obtained 
for the BA-CELP coder and for a conventional 
forward-adaptive version. The conventional 8 
Kb/s CELP coder computed the LPC analy- 
sis on 20 ms of buffered speech. The auto- 
correlation method was used to calculate the 
LPC coefficients which were transformed to 
linear-spectrum pairs and quantized. For the 
BA-CELP coder we used the mixed distortion 
parameters r/=J and n c =16. For these val- 
ues, the NWMSE was greater than the mini- 
mum in 20% of the speech blocks. The shift- 
symmetric excitation codebook was optimized 
using a 10 -minute speech database. 

Formal listening tests were conducted fol- 
lowing the CCITT recommendations in [9]. The 
stimulus material contained six different sen- 
tences spoken by different males and females. 
Twenty listeners evaluated speech quality under 
five different conditions, 2 coders and 3 refer- 
ences. The reference conditions consisted of the 
original speech (PCM 64 Kb/s) and speech cor- 
rupted with random noise which has amplitude 
proportional to the instantaneous signal ampli- 
tude. The distorted speech is specified according 
to the modulated noise reference unit (MNRU) 
[10], Mean opinion scores and 95% confidence 
intervals are shown in table 2. According to our 
results, speech quality in the forward- adaptive 
coder is only 0 . 1 points in the MOS scale better 
than the BA-CELP coder. 


Condition 

Mean 

Error 

PCM 64 Kb/s 

4.24 

0.15 

Forward 

3.42 

0.20 

Backward 

3.33 

0.19 

MNRU 20 dB 

2.39 

0.18 

MNRU 15 dB 

1.68 

0.17 


Table 2 Mean opinion score test. 
Mean and 95% confidence intervals. 
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Figure 3 shows noise-weighted signal-to- 
noise ratio as a function of 77 for a 30 second seg- 
ment of speech. The dashed line represents the 
NWSNR for the conventional search (no LLR 
distortion). Observe that for values of 77 between 
0.2 and 3 the global NWSNR is greater than for 
the regular search NWSNR. This shows how the 
global NWMSE was reduced by using the sub- 
optimal (in a NWMSE sense) mixed distortion 
measure. In other words, an increase in the error 
of one block of samples helps in filter tracking 
and therefore improves the overall performance 
of the coder. Figure 4 shows the log-likelihood 
ratio distortion measure for different values of 77 . 




Figure 4 Log-likelihood ratio distortion vs 77. 


CONCLUDING REMARKS 

In this paper we discussed how a delay of 
approximately 13 ms is achieved in the BA- 
CELP. Based on subjective MOS tests, speech 
quality has been found to be comparable to that 


of conventional forward-adaptive CELP coders. 
However, several LPC analyses are necessary 
to compute the mixed distortion measure. The 
number of candidates in the search procedure 
determines the computational complexity of the 
coder. Further reductions in complexity may be 
possible by using recursive LPC algorithms in- 
stead of block algorithms. To further reduce 
the delay, future investigation would include 
backward-adaptation of the remaining parame- 
ters. 
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